AdaFactor: avoid updating group["lr"] attributes #9751

ceshine · 2021-01-22T15:36:27Z

This affects Adafactor with relative_step=False and scale_parameter=True.

Updating group["lr"] makes the result of ._get_lr() depends on the previous call, i.e., on the scale of other parameters. This isn't supposed to happen.

What does this PR do?

I've observed weird behaviors when using Adafactor with relative_step=False and scale_parameter=True and an LR scheduler. I think the problem is that the code updates the lr attribute of the current parameter group, and then uses the updated attribute to calculate the next attribute. I don't think this is supposed to happen.

A simple fix would be replacing the update operation with an assignment to a local variable.

I'm not entirely sure if I understand the problem correctly, so I apologize in advance if this is a stupid PR. I'd appreciate it if someone could point out where I am wrong. Thanks!

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@moscow25 @sshleifer

This affects Adafactor with relative_step=False and scale_parameter=True. Updating group["lr"] makes the result of ._get_lr() depends on the previous call, i.e., on the scale of other parameters. This isn't supposed to happen.

sshleifer · 2021-01-22T15:47:05Z

Can you provide evidence that supports the following:

Updating group["lr"] makes the result of ._get_lr() depends on the previous call, i.e., on the scale of other parameters. This isn't supposed to happen.

Thanks!

ceshine · 2021-01-22T16:07:41Z

Can you provide evidence that supports the following:

Updating group["lr"] makes the result of ._get_lr() depends on the previous call, i.e., on the scale of other parameters. This isn't supposed to happen.

Thanks!

Hi,

Thanks for the quick reply.

This is taken from the AdaFactor paper:

As you can see, ρ only depends on the step number if we use relative steps. And if we switch to any other learning rate schedules (in my case, linear warmup + cosine decay), it doesn't make sense to make the ρ part depends on the scale of the other parameters, nor can I find any reference of this approach in the paper.

If we (loosely) factor the α_t in the original implementation to α_i,t, where i indicate the set of parameters corresponding to the for p in group["params"] loop. The original implementation essentially made α_i,t depended on α_i-1,t (i.e., making ρ_i,t = α_i-1,t).

ceshine · 2021-01-23T05:22:17Z

I've observed weird behaviors when using Adafactor with relative_step=False and scale_parameter=True and an LR scheduler.

I should probably clarify what I meant by "weird behaviors." The model (T5 v1.1) never converged when trained Adafactor with relative_step=False and scale_parameter=True. After this patch, I managed to get convergence and even better results than the built-in LR schedule in the relative_step=True mode (with warmup_init=True).

sshleifer · 2021-01-28T15:36:57Z

cc @patrickvonplaten @patil-suraj
This looks like a reasonable change to me!

patrickvonplaten

I agree very much with your explanation here @ceshine - that's a great fix, thanks!

BTW, if you have some working code for how to train a google/t5v1_1 model I think it would be super helpful to post it here, on the forum or as a community notebook! Many people have been asking for good t5v1_1 training scripts :-)

stas00

Looks good. Thank you!

It doesn't look like any other entry in group gets modified.

Ideally in such situation it's a great opportunity to add a test that detects the problem - i.e. lack of convergence, I can imagine this would be quite tricky to accomplish!

LysandreJik

Thank you for your explanation and providing references. LGTM.

ceshine · 2021-02-03T06:28:10Z

Thank you all for your time and for accepting the patch! Glad to have made a tiny contribution to this great library.

BTW, if you have some working code for how to train a google/t5v1_1 model I think it would be super helpful to post it here, on the forum or as a community notebook! Many people have been asking for good t5v1_1 training scripts :-)

I don't have anything that is sufficiently readable yet. Nonetheless, I have these notebooks published on Kaggle that use the patched Adafactor: one for T5 v1.1 and one for mT5. They are based on this Github repo, which is quite messy at this moment. The part that set up the optimizer is located here.

Adafactor: avoid updating group["lr"] attributes

8dcc2df

This affects Adafactor with relative_step=False and scale_parameter=True. Updating group["lr"] makes the result of ._get_lr() depends on the previous call, i.e., on the scale of other parameters. This isn't supposed to happen.

sshleifer approved these changes Jan 28, 2021

View reviewed changes

patrickvonplaten requested review from sgugger, stas00 and LysandreJik February 1, 2021 06:16

patrickvonplaten approved these changes Feb 1, 2021

View reviewed changes

stas00 approved these changes Feb 1, 2021

View reviewed changes

LysandreJik approved these changes Feb 1, 2021

View reviewed changes

sgugger merged commit 8672bcd into huggingface:master Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AdaFactor: avoid updating group["lr"] attributes #9751

AdaFactor: avoid updating group["lr"] attributes #9751

ceshine commented Jan 22, 2021

sshleifer commented Jan 22, 2021

ceshine commented Jan 22, 2021

ceshine commented Jan 23, 2021

sshleifer commented Jan 28, 2021

patrickvonplaten left a comment •

edited

Loading

stas00 left a comment

LysandreJik left a comment

ceshine commented Feb 3, 2021

AdaFactor: avoid updating group["lr"] attributes #9751

AdaFactor: avoid updating group["lr"] attributes #9751

Conversation

ceshine commented Jan 22, 2021

What does this PR do?

Before submitting

Who can review?

sshleifer commented Jan 22, 2021

ceshine commented Jan 22, 2021

ceshine commented Jan 23, 2021

sshleifer commented Jan 28, 2021

patrickvonplaten left a comment • edited Loading

Choose a reason for hiding this comment

stas00 left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

ceshine commented Feb 3, 2021

patrickvonplaten left a comment •

edited

Loading